How to select a good voice for TTS
نویسنده
چکیده
Even though the perceived quality of a speaker’s natural voice does not necessarily guarantee the quality of synthesized speech, it is required to select a certain number of candidates based on their natural voice before moving to the evaluation stage of synthesized sentences. This paper describes a male speaker selection procedure for unit selection synthesis systems in English and Japanese based on perceptive evaluation and acoustic measurements of the speakers’ natural voice. A perceptive evaluation is performed on eight professional voice talents of each language. A total of twenty native-speaker listeners are recruited in both languages and each listener is asked to rate on eight analytical factors by using a five-scale score and rank three best speakers. Acoustic measurement focuses on the voice quality by extracting two measures from Long Term Average Spectrum (LTAS), the socalled Speakers Formant (SPF), which corresponds to the peak intensity between 3 kHz and 4 kHz, and the Alpha ratio (AR), which is the lower level difference between 0 and 1 kHz and 1 and 4 kHz ranges. The perceptive evaluation results show a very strong correlation between the total score and the preference in both languages, 0.9183 in English and 0.8589 in Japanese. The correlations between the perceptive evaluation and acoustic measurements are moderate with respect to SPF and AR, 0.473 and -0.494 in English, and 0.288 and -0.263 in Japanese.
منابع مشابه
Dialog speech acts and prosody: Considerations for TTS
As natural language dialog systems involving both speech recognition and text-to-speech (TTS) synthesis become more sophisticated, the limitations of general-purpose TTS for human-computer dialogs have become more apparent. Much subtlety and complexity of meaning in natural language dialogs is conveyed by prosody; how something is said is often as important as what words are spoken. At the same...
متن کاملMARY TTS unit selection and HMM-based voices
This paper describes the implementation of a unit selection English voice and a HMM-based Hindi voice for our participation in the Blizzard Challenge 2013. The two voices have been created using the MARY TTS voice building framework. We describe how audiobook data is used to create the English voice and how a quality controlmeasure (statisticalmodel cost) is used to control the selection of uni...
متن کاملUnit selection based on voice recognition
In this paper, we describe a perceptual voice recognition method to improve the naturalness of synthesized speech for Mandarin Chinese text-to-speech (TTS) baseline system. As a large TTS speech corpus, speech data always has different acoustic properties for different data recording conditions. Speech data recorded under different conditions can finally influence the naturalness of synthesized...
متن کاملMultilingual MARY TTS participation in the Blizzard Challenge 2009
The paper describes the Blizzard Challenge 2009 participation of MARY TTS, an open-source TTS system using a unit selection voice. We briefly outline the new language support framework we provide so that people can add support for their languages to MARY TTS, and describe how that framework was used for building a Mandarin Chinese system and voice. The system performs well for English and reaso...
متن کاملHow (not) to select your voice corpus: random selection vs. phonologically balanced
This paper compares the effect of two different voice corpus selection methods on the overall quality of unit selection-based text-to-speech (TTS) voices resulting from training on these corpora. The first selectionmethod aims to maximize the coverage of stressed as well as unstressed diphones (phonologically balanced: Phonbal) while the second method simply selects sentences at random (Random)...
متن کامل